{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# N-Grams\n",
    "\n",
    "给一系列的词语计算概率的模型叫做**语言模型(Language Models)**，其中，**n-gram**是最简单的一种。一个`n-gram`就是一个长度为`N`的词语组成的序列：\n",
    "\n",
    "* `N=2`，则是`2-gram`(**bigram**)\n",
    "* `N=3`，则是`3-gram`(**trigram**)\n",
    "\n",
    "## 一个简单的例子\n",
    "\n",
    "有一个任务，要计算$P(w\\vert h)$，即给定历史$h$计算$w$的概率。假设$h=its\\ water\\ is\\ so\\ transparent\\ that$，我们要计算下一个词`the`的概率，即：\n",
    "\n",
    "$$P(the\\ \\vert\\ its\\ water\\ is\\ so\\ transparent\\ that)$$\n",
    "\n",
    "一个可行的方式是：**在一个很大的语料库中，统计$its\\ water\\ is\\ so\\ transparent\\ that$出现的次数，然后统计$its\\ water\\ is\\ so\\ transparent\\ that\\ the$的次数，后者除以前者**，即：\n",
    "\n",
    "$$P(the\\ \\vert\\ its\\ water\\ is\\ so\\ transparent\\ that)=\\frac{C(\\ its\\ water\\ is\\ so\\ transparent\\ that\\ the)}{C(\\ its\\ water\\ is\\ so\\ transparent\\ that)}$$\n",
    "\n",
    "这种方式在很多情况下可行。但是某些情况下仍然会有以下问题：\n",
    "\n",
    "* 有些词语在预料中出现次数为0。\n",
    "\n",
    "相似的问题还出现在：如果我们想知道整个序列的**联合概率**，例如$P(its\\ water\\ is\\ so\\ transparent)$，那我们就可以将问题转化为：“在所有5个词语的序列中，`its water is so transparent`出现了几次？”\n",
    "\n",
    "为了解决这个问题，我们需要更好地方式来估计$w$基于$h$的概率，或者整个序列的概率。\n",
    "\n",
    "我们把一个长度为`N`的序列表示为$$w_1,w_2,\\dots,w_n$$，简单表示成$w_1^n$，那么：\n",
    "\n",
    "$$P(w_1^n) = P(w_1)P(w_2\\vert w_1)P(w_3\\vert w_1^2)\\dots P(w_n\\vert w_1^{n-1}) = \\prod_{k=1}^nP(w_k\\vert w_1^{k-1}) $$\n",
    "\n",
    "上面的**链式法则(chain rule)**表面了**联合概率**和**条件概率**之间的联系。\n",
    "\n",
    "但是上面的连乘也带来问题：\n",
    "\n",
    "* 对于很长的序列，计算量很大\n",
    "* 如果其中任何一项概率为0，那么整个联合概率变为0\n",
    "\n",
    "## N-Grams\n",
    "\n",
    "**n-gram**并不是计算某个词基于整个历史的概率，而是用少数几个历史词语来替代整个历史。\n",
    "\n",
    "对于**bigram**模型，我们只计算$P(w_n\\vert w_{n-1})$即可，也就是说，$P(w_n\\vert w_1^{n-1})$退化成了$P(w_n\\vert w_{n-1})$，即：\n",
    "\n",
    "$$P(w_n\\vert w_1^{n-1})\\approx P(w_n\\vert w_{n-1})$$\n",
    "\n",
    "这个**单词的概率只基于前面若干个个单词**假设叫做**马尔科夫假设(Markov assumption)**。\n",
    "\n",
    "因此，对于通常的**n-gram**模型，下一个单词基于前面所有词的条件概率，可以简化为：\n",
    "\n",
    "$$P(w_n\\vert w_1^{n-1})\\approx P(w_n\\vert w_{n-N+1}^{n-1})$$\n",
    "\n",
    "回到**bigram**模型，我们整个序列的联合概率，可以简化成:\n",
    "\n",
    "$$P(w_1^n)\\approx \\prod_{k=1}^nP(w_k\\vert w_{k-1})$$\n",
    "\n",
    "\n",
    "那么我们怎么估计这些**bigram**或者**n-gram**的概率呢？一个直观的方法就是**最大似然估计(maximum likelihood estimation or MLE)**。\n",
    "\n",
    "具体的做法是，选一个预料，进行计数，然后进行归一化。\n",
    "\n",
    "还是以**bigram**为例，\n",
    "\n",
    "$$P(w_n\\vert w_{n-1})=\\frac{C(w_{n-1}w_n)}{\\sum_wC(w_{n-1}w)}$$\n",
    "\n",
    "分母是**所有以$w_{n-1}$开头的词语的总数**，也就是所，可以简化为**$w_{n-1}$出现的次数**！即：\n",
    "\n",
    "$$P(w_n\\vert w_{n-1})=\\frac{C(w_{n-1}w_n)}{C(w_{n-1})}$$\n",
    "\n",
    "在实际的处理中，我们通常会给每一个句子加上**开始标记**和**结束标记**，例如`<s>`和`</s>`。\n",
    "\n",
    "`<s>`是为了让第一个词语有上下文，`</s>`是为了制造一个更加真实的概率分布。\n",
    "> We need the end-symbol to make the bigram grammer a true probability distribution. Without an end-symbol, the sentence probabilities for all sentences of given length would sum to one.\n",
    "\n",
    "## 实际操作的问题和技巧\n",
    "\n",
    "还有一些实际上的问题和处理方法。\n",
    "\n",
    "在`N`比较大的**n-gram**模型中，比如**4-gram**或者**5-gram**，对于句子前面的几个单词，我们没有足够长的历史词，那么我们的做法就是制造几个伪词。\n",
    "\n",
    "例如，在**3-gram**中，对于句子`I love cake.`，我们首先处理成`<s>I love cake.</s>`，那么对于第一个词的条件概率$P(I)$怎么计算呢？答案是使用伪词，`P(I|<s><s>)`。\n",
    "\n",
    "还有就是，我们通常使用**对数概率(log probability)**而不是上面所说的概率，因为对数概率带来两个好处：\n",
    "\n",
    "* 取对数可以把连乘转换成累加，可以加速计算\n",
    "* 可以防止数值溢出\n",
    "\n",
    "$$p_1\\times p_2\\times p_3\\times p_4=\\exp(p_1+p_2+p_3+p_4)$$\n",
    "\n",
    "## 评估语言模型\n",
    "\n",
    "在数据集的划分上，通常的做法是把训练集分成以下三部分：\n",
    "```bash\n",
    ".\n",
    "|-data_set\n",
    "    |----training_set(80%)\n",
    "    |----dev_set     (10%)\n",
    "    |----test_set    (10%)\n",
    "```\n",
    "\n",
    "其中，训练集用来训练模型，验证集用来调整模型的参数，测试集用来评估模型的性能。\n",
    "\n",
    "一个常用的评估方式就是**困惑度(perplexity)**。\n",
    "\n",
    "对于一个测试集$W=w_1w_2\\dots w_N$，困惑度计算如下：\n",
    "\n",
    "$$PP(W)=P(w_1w_2\\dots w_N)^{-\\frac{1}{N}}=\\sqrt[N]{\\frac{1}{P(w_1w_2\\dots w_N)}}$$\n",
    "\n",
    "根据链式法则，有\n",
    "\n",
    "$$PP(W)=\\sqrt[N]{\\prod_{i=1}^N \\frac{1}{P(w_i\\vert w_1\\dots w_{i-1})}}$$\n",
    "\n",
    "特别地，对于**bigram**，上式可以简化为\n",
    "\n",
    "$$PP(W)=\\sqrt[N]{\\prod_{i=1}^N \\frac{1}{P(w_i\\vert w_{i-1})}}$$\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
